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PROBLEM 

Repeated  trials  on  a  task  are  frequently  required  for  assessing  training 
procedures  or  experimental  treatments.  Limited  time,  money,  or  availability 
of  research  subjects  often  result  In  the  need  to  give  a  substantial  number  of 
trials  on  a  task  within  a  short  period  of  time.  However,  In  many  laboratories 
repeated  measures  are  traditionally  separated  by  24  hours  or  more  to  reduce 
the  chances  of  fatigue.  Interference,  or  other  factors  Introducing  undesirable 
error  variance.  Massing  practice  Is  an  obvious  alternative  to  distributing 
It,  particularly  when  time  constraints  exist.  However,  massed  practice  Is 
only  a  desirable  alternative  If  the  resulting  test  scores  maintain  the 
statistical  properties  required  for  repeated  measures  analysis. 

FINDINGS 

'  Paper-and-pencll  and  computerized  versions  of  traditional  human 
performance  tests  were  examined  under  massed  practice  conditions.  Many  of  the 
tests  had  been  shown  to  have  high  reliabilities  and  to  meet  the  statistical 
requirements  for  repeated  measures  applications  under  distributed  practice 
conditions  In  earlier  studies  at  our  laboratory.  The  tests  were:  Grammatical 
Reasoning,  Pattern  Comparison,  Purdue  Pegboard,  Aiming,  Spoke,  Maze  Tracing, 
Code  Substitution,  Arithmetic,  Stroop,  and  Memory  Scanning.  Although  more 
time  was  required  for  task  stabilization  In  most  cases,  all  of  the  paper-and- 
pencll  task's  retained  high  reliabilities  under  massed  practice  conditions, 
except  Pattern  Comparison  and  Maze  Tracing.  The  latter  appeared  to  have 
unequivalent  alternate  forms.  Computer  adaptations  of  task  failed  to  maintain 
the  statlsllcal  properties  required  for  repeated  measures  analysis. 

RECOMICNQATIONS 

It  Is  recommended  that  distributed  practice  with  trials  separated  by  24 
hours  or  more  be  used  whenever  feasible.  If  massed  practice  Is  required  tasks 
should  be  chosen  which  have  been  shown  to  have  high  reliability  and  which  meet 
the  statistical  requirements  for  repeated  measires  experimentation.  It  Is 
expected  that  once  computer  tasks  are  refined  they  too  will  lend  themselves  to 
massed  practice  administration  when  required. 


TWe  authors  wish  to  thank  Richard  Irons  and  Timothy  Whitten  for  their 
reliable  computer  programming/maintenance  support.  Special  thanks  to  Robert 
Carter  and  Alvah  Bittner,  for  sharing  their  data  analysis  expertise. 

The  volunteers  used  In  this  study  were  recruited,  evaluated,  and  employed 
In  accordance  with  the  procedures  specified  In  Secretary  of  the  Navy 
Instruction  Series  3900.39  and  Bureau  of  Medicine  and  Surgery  Instruction 
Series  3900.6.  These  Instructions  are  based  upon  voluntary  consent,  and  meet 
or  exceed  the  provisions  of  prevailing  national  and  international  guidelines. 

Trade  names  of  materials  or  products  of  commercial  or  non-government 
organizations  are  cited  where  essential  for  precision  In  describing  research 
procedures  or  evaluation  of  results.  Their  use  does  not  constitute  official 
endorsement  or  approval  of  the  use  of  such  commercial  hardware  or  software. 
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MASSED  PRACTICE:  DOES  IT  CHANGE  THE  STATISTICAL  PROPERTIES 

OF  PERFORMANCE  TESTS? 


INTRODUCTION 

Parallelism  of  measurements  is  an  assumption  underlying  repeated  measures 
experiments  (Jones*  1972;  Lord  &  Novick,  1968;  Winer,  1971).  To  ensure 
parallelism,  repeated  measures  are  usually  separated  by  several  hours  at  the 
least,  and  normally  by  24  hours  or  more.  Such  long  intervals  between  tests 
are  necessary  to  avoid  fatigue,  proactive  interference,  and  difficulty  sus¬ 
taining  subjects'  motivation.  Given  this,  experiments  which  call  for  few 
measurements  per  subject  are  feasible.  However,  experiments  requiring  many 
repeated  measures  become  impractical  and  often  impossible.  There  are  several 
reasons  why  an  extended  study  is  undesirable.  First,  the  internal  validity  of 
experiments  may  be  affected  by  extraneous  events  other  than  the  experimental 
variables  occurring  between  measurements  (Campbell  *  Stanley,  1963,  p.4).  The 
probability  of  this  occurrence  increases  as  the  number  of  measurements  and  the 
amount  of  time  between  measurements  increases.  Differential  loss  of  respon¬ 
dents  from  the  comparison  groups,  and  maturation  of  subjects  and  apparatus  are 
two  examples  of  extraneous  events  enumerated  by  Campbell  and  Stanley  which 
jeopardize  internal  validity  of  prolonged  repeated  measures  experimentation. 

In  addition,  repeated  measures  experimentation  is  relatively  expensive  in 
terms  of  both  experimenter  and  subject  time.  Nevertheless,  the  need  for  re¬ 
peated  measures  experimentation  exists.  Therefore,  it  Is  worthwhile  to  in¬ 
vestigate  procedures  which  will  yield  comparable,  reliable  parallel  measure¬ 
ments. 

Lord  and  Novick  (1968)  and  Jones  (1980)  have  identified  the  statistical 
properties  that  tests  should  possess  before  they  are  used  in  repeated  measures 
experimentation.  Briefly,  they  define  the  requirements  as:  (1)  constant  or 
linearly  Increasing  means  across  repeated  measurements,  (2)  unchanging  vari¬ 
ances,  and  (3)  differential  stability.  Differential  stability,  as  outlined  by 
Jones  (1980),  indicates  that  subjects'  relative  rank  order  is  not  changing  and 
consequently  intersession  correlations  rr-’ain  constant.  That  is,  the  task 
should  measure  the  same  ability  each  time  it  is  used.  In  addition,  a  task 
must  have  sufficient  definition  (Jones,  1980).  Task  definition  is  indicated 
by  the  averaged  correlation  across  differentially  stable  trials.  (See 
Bittner,  1981;  Bittner,  Dunlap,  &  Jones,  1982;  Dunlap,  Jones,  &  Bittner,  1983, 
for  a  justification  of  averaging  correlations.) 

Several  human  performance  tests  which  meet  these  statistical  criteria 
have  been  identified  (Harbeson,  Bittner,  Kennedy,  Carter,  &  Krause,  1983; 
Kennedy,  Carter,  &  Bittner,  1980).  These  tests  were  examined  using 
distributed  trials  of  repeated  measurement.  The  current  study  utilized  a 
sample  of  those  tests,  for  purposes  of  investigating  whether  massed  practice 
yields  results  comparable  to  those  obtained  with  distributed  practice  (i.e., 
results  obtained  when  repeated  measures  were  collected  at  daily  intervals 
across  several  separate  testing  sessions).  Both  traditional  apparatus  and 
paper-and-pencil  versions  and  newly  programmed  computer  versions  of  classical 
tests  were  Investigated.  Comparable  results  between  the  two  practice 
schedules  would  indicate  that  massed  rather  than  distributed  practice  could  be 
given  to  stabilize  scores  so  that  parallel  repeated  measures  data  could  be 
collected. 
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The  purpose  of  this  study  was  to  determine  whether  people  perform  compar¬ 
ably  when  given  massed  as  opposed  to  distributed  practice  on  tests.  Tests 
which  met  the  statistical  requirements  for  parallel  repeated  measures  when 
practice  was  distributed  (i.e.,  trials  were  separated  by  >  24  hours)  were 
given  mass-practice  to  determine  whether  the  desired  statistical  properties  of 
the  tests  were  again  obtained. 

METHOD 


Seventeen  Navy  enlisted  men  between  the  ages  of  18  and  25  were  subjects 
for  this  experiment.  All  subjects  were  volunteers  for  environmental  research 
experiments  and  met  the  health  qualifications  described  by  Thomas,  Majewski, 
Ewing,  and  Gilbert  (1978). 

Apparatus  and  Task  Descriptions 

Six  tests  of  cognitive,  spatial,  and  motor  ability  were  used  in  this 
study:  Grammatical  Reasoning,  Pattern  Comparison,  Purdue  Pegboard,  Aiming, 
Spoke,  and  Maze  Tracing.  Each  task  is  described  below. 

Grammatical  Reasoning.  This  task,  modeled  after  Baddeley's  (1968),  meets 
the  statistical  requirements  for  repeated  measures  testing  (Carter,  Kennedy,  & 
Bittner,  1981).  The  Grammatical  Reasoning  test  provides  a  measure  of  "higher 
mental  processes"  (Baddeley,  1968).  Subjects  were  asked  to  decide  whether  a 
statement  accurately  described  the  relative  position  of  two  letters  printed  to 
the  right  of  that  statement.  A  typical  item  would  look  like: 

A  is  preceded  by  B  BA  T  F 

The  subjects  were  Instructed  to  put  a  slash  through  the  "T"  if  the  statement 
was  true  and  a  slash  through  "F"  If  the  statement  falsely  described  the  letter 
positions.  Half  of  the  statements  were  In  the  active  voice  (e.g.,  B  follows 
A)  and  half  passive  (e.g.,  B  is  followed  by  A).  Additionally,  half  were  neg¬ 
ative  (e.g.,  A  does  not  precede  B)  and  half  were  positive  statements  (e.g.,  A 
precedes  B).  Twenty-four  alternate  forms,  each  with  32  items,  were  generated 
by  a  FORTRAN  program  (see  Carter  &  Sblsa,  1982,  for  the  program  listing).  The 
score  recorded  was  the  number  correct  minus  the  number  incorrect  for  a  60 
second  administration. 

Pattern  Comparison.  This  test  of  perceptual  speed  was  found  to  be  suit¬ 
able  for  repeated  measures  experimentation  (Klein  A  Armitage,  1979;  Shannon, 
Carter,  A  Boudreau,  1981).  The  object  of  this  task  was  to  determine  whether 
two  patterns  were  the  same  or  different.  A  typical  "different"  trial  looked 
like: 

*  * ** 

* 

**  * 

♦  ★ 
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Subjects  were  Instructed  to  write  an  "S"  on  the  dashed  line  if  the  patterns 
were  the  same  and  a  "0"  if  they  were  different.  Subjects  were  given  144  total 
problems  and  2  minutes  to  do  as  many  as  they  could.  The  score  was  the  number 
correct  minus  the  number  incorrect. 

Purdue  Pegboard.  This  is  a  test  of  finger  dexterity  designed  by  Science 
Research  Associates,  Inc.  (Tiffin,  1968).  Subjects  were  instructed  to  place 
cylindrical  (2.5  mm  in  diameter)  pegs  into  sequential  holes  until  all  were 
filled,  or  until  the  maximum  time  limit  of  two  minutes  was  reached. 

Aiming.  This  is  a  test  of  fine  manipulative  ability  and  is  described 
more  fully  by  Fleishman  and  Ellison  (1962).  The  subject  was  required  to  make 
one  dot  in  each  of  a  series  of  very  small  circles  (3  mm  in  diameter),  working 
as  quickly  and  as  accurately  as  possible.  The  score  was  the  number  of  dots 
correctly  placed  in  2  minutes. 

Spoke.  This  task,  which  measures  speed  of  lower  arm  movement,  was 
fashioned  after  the  Reitan  Trail  Making  Test  (Form  A).  Investigations  in¬ 
dicate  that  this  task  is  suitable  for  repeated  measures  use  (Bittner,  Lundy, 
Kennedy  &  Harbeson,  1982).  The  display  sheets  (43cm  x  28cm)  contained  32  cir¬ 
cular  targets  arranged  concentrically  around  a  central  circular  target.  Each 
target  was  9.5mm  in  diameter  and  located  120.6mm  from  the  central  target. 
Distance  from  the  center  of  one  target  to  an  adjacent  target  was  25.4mm.  A 
number  was  displayed  in  the  center  of  each  target.  The  subject  was  required 
to  alternately  tap  the  stylus  on  the  center  target  and  each  of  the  numbered 
circles  (i.e.,  0,  1,  0,  2,  ...  0,  32).  The  score  recorded  was  time  to  comple¬ 
tion. 

Maze  Tracing.  Ekstrom,  French,  Harman,  and  Dermen  (1976)  identify  this 
task  as  loading’ on  a  spatial  scanning  factor.  It  measures  the  ability  to  find 
a  path  through  24  interconnected  mazes.  Variations  of  the  original  forms  of 
this  test  were  generated  by  Shannon  (personal  communication,  1982).  The  score 
was  the  number  of  blocks  completed  within  2  minutes. 

Procedure 

Testing  was  conducted  on  five  consecutive  weekdays,  between  the  hours  of 
7:30  and  11:30  in  the  morning.  Six  tables,  each  with  one  test  on  it,  were 
located  around  a  large  room.  Subjects  rotated  from  one  table  to  the  next  in  a 
different  random  order  on  each  day  until  they  had  completed  the  full  cycle. 

After  each  cycle,  the  order  in  which  subjects  took  each  test  was  random¬ 
ized.  Eight  replications  of  each  test  were  administered  on  Day  1,  followed  by 
four  replications  on  Days  2-5.  Subjects  were  tested  in  two  groups  of  five 
and  one  group  of  seven. 


The  subjects  were  14  Navy  enlisted  men  between  the  ages  of  18  and  25. 

All  subjects  were  volunteers  for  the  same  environmental  research  program  spec¬ 
ified  in  Experiment  1  and  met  the  health  qualifications. 
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Apparatus 

Testing  equipment  included  APPLE  II  PLUS®  microcomputers  connected  to  and 
controlled  by  a  NESTAR  CLUSTER/ONE  MODEL  A®  central  networking  system.  This 
system  provided  for  simultaneous  testing  on  six  APPLE®  computers  and  automatic 
data  storage  from  each  testing  station.  Each  computer  was  equipped  with  64K 
memory,  an  interval  timing  clock  (Mountain  Hardware  Inc.),  and  an  APPLE® 
language  card.  Stimulus  was  presented  on  13-inch  screens.  Four  Zenith® 
monitors  and  two  Quasar®  color  TV's  were  used.  In  addition  to  the  APPLE® 
keyboard,  a  numeric  keypad  (Advanced  Business  Technology,  Inc.),  standard 
APPLE®  paddles,  and  a  three-button  box  built  in-house  served  as  response 
devices.  Hence,  input  to  the  subject  was  visual  while  manual  responses  were 
required. 

Task  Descriptions 

Six  well-known  tests  of  mental  functioning.  Code  Substitution,  Math, 
Stroop,  Memory  Scanning,  Grammatical  Reasoning,  and  Pattern  Comparison,  were 
used  in  this  study.  All  tests  were  programmed  in  Applesoft  Basic®  language, 
for  implementation  on  the  APPLE®  microcomputers.  Program  listings  for  these 
tests  are  available  from  the  authors  upon  request.  Computer  adaptations  of 
the  six  tests  used  in  this  study  are  described  in  detail  below. 


Code  Substitution.  This  test  is  conceptually  the  same  as  that  on  the 
Wechsler  Adult  Intelligence  Scale-Revised  (1980).  It  has  been  found  to  meet 
the  statistical  criteria  necessary  for  repeated  measures  testing  (Pepper, 
Kennedy,  Bittner,  &  Wiker,  1981).  Pairs  of  letters  and  numbers  were  presented 
to  the  subjects  and  their  task  was  to  respond  with  the  appropriate  digit  when 
a  code  letter  was  presented  alone.  Nine  letters,  generated  randomly  by  the 
computer,  were  paired  with  the  digits  one  (1)  through  nine  (9).  During  a 
trial,  one  of  the  nine  letters  would  print  in  the  top  three-quarters  of  the 
screen,  above  the  digit-code  pairs,  and  would  remain  until  the  subject  pushed 
one  of  the  keys  on  a  nine-key  numerical  pad.  Throughout  the  three  minute 
task,  the  coded  pairs  remained  the  same.  The  score  was  the  number  of  correct 
responses.  A  sample  of  the  stimulus  display,  as  it  appeared  on  the  screen  is: 


CODE  A  M  N  T  S 
DIGIT  51428 


R  Q  V 
7  9  3 


X 

6 


Arithmetic.  This  test  of  arithmetic  computation  is  similar  to  the  Number 
Facility  tests  described  by  Ekstrom,  French,  Harman,  and  Dermen  (1976).  Its 
statistical  reliability  nd  stability  indicate  that  it  is  an  appropriate  test 
to  use  in  repeated  measures  testing  (Seales,  1980).  This  task  included 
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addition,  subtraction,  multiplication,  and  division  problems.  Within  blocks 
of  four,  each  type  of  problem  was  randomly  presented  once.  In  an  attempt  to 
keep  difficulty  levels  equivalent,  addition  was  restricted  to  3-by-3  and 
3-by-2  problems,  subtraction  to  3-by-3  only,  division  to  l-by-3,  and  l-by-4, 
and  multiplication  to  l-by-2.  The  task  was  to  perform  the  computation 
mentally  and  enter  the  answer  on  a  13-key  numerical  pad.  A  marker  on  the 
screen  indicated  where  the  subject  should  start  keying  in  responses.  Once  a 
response  was  typed  it  could  be  changed  by  pushing  an  "erase"  key  or  entered, 
allowing  for  the  next  problem  to  appear  on  the  screen.  Numbers  were  graphed 
across  the  center  of  the  screen  using  the  low  resolution  graphics  mode,  and 
each  measured  approximately  2.5cm  x  2.5cm.  A  sample  division  problem  is: 


The  other  three  types  of  problems  looked  similar,  except  that  responses  were 
entered  from  right  to  left  on  addition,  subtraction,  and  multiplication 
problems  rather  than  from  left  to  right  as  in  the  division  problems.  Problems 
were  presented  consecutively,  for  two  minutes,  with  approximately  two  seconds 
between  the  subject's  entering  a  response  and  the  presentation  of  another. 

The  score  was  the  number  of  correct  responses  for  problems  that  were  started 
within  the  two  minute  time  frame. 

Stroop.  This  test  represents  one  of  several  versions  of  a  serial  verbal 
task  Involving  interference,  which  was  designed  by  Stroop  (1935,1938).  The 
version  used  In  this  experiment  was  similar  to  one  found  to  have  the  statisti¬ 
cal  characteristics  necessary  for  repeated  measures  testing  (Harbeson,  Krause, 
Kennedy,  &  Bittner,  1982).  Subjects  were  instructed  to  respond  to  either  a 
word  or  a  color  In  this  task.  There  were  three  conditions:  black-white  (BW), 
color-word  (CW),  and  color-color  (CC).  In  the  BW  condition,  the  words  "RED", 
"BLUE",  and  "GREEN"  were  presented  on  the  screen  in  random  order,  and  in 
black-and-white.  Subjects  were  instructed  to  push  buttons  which  corresponded 
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to  the  word  appearing  on  the  screen.  The  words,  "RED",  "BLUE",  and  "GREEN" 
were  also  used  in  the  CW  and  CC  conditions,  but  were  written  in  one  of  the 
three  colors  so  that  the  color  might  or  might  not  have  matched  what  the  word 
said.  In  the  CW  condition,  subjects  were  instructed  to  ignore  the  color  and 
respond  only  to  the  word.  Just  the  opposite  was  requested  in  the  CC  con¬ 
dition;  subjects  were  asked  to  ignore  the  word  and  respond  to  the  color  the 
word  was  written  in.  Letters  were  graphed  across  the  center  of  the  screen, 
using  the  high  resolution  mode,  and  were  approximately  3.0cm  x  4.0cm  in  size. 
Subjects  responses  were  input  on  a  three-button  box,  with  one  being  red,  one 
blue,  and  one  green.  In  addition,  the  letters  'R * ,  'B',  and  ' G '  were  typed 
above  the  corresponding  button.  Approximately  one  second  lapsed  between 
trials  within  a  condition,  and  each  condition  lasted  45  seconds.  Each  con¬ 
dition  was  presented  once  within  a  sitting,  but  in  random  order  each  time. 

The  order  remained  the  same  across  subjects.  The  score  for  this  test  was  num 
ber  correct  minus  a  portion  (.33)  of  the  incorrect  responses. 

Memory  Scanning.  Sternberg's  (1966,  1975)  information  processing  task 
was  used  as  a  model  in  programming  this  test.  A  "target"  of  from  one  to  four 
digits  was  presented,  immediately  followed  by  a  single  "probe"  digit.  The 
subjects  task  was  to  respond  positively  or  negatively,  depending  on  whether 
the  "probe"  digit  was  one  of  the  "target"  digits.  Stimuli  were  randomly  gen¬ 
erated  numbers  from  zero  through  nine.  The  numbers  were  graphed  across  the 
center  of  the  screen,  using  the  low  resolution  graphics  mode,  and  were  2.5  cm 
x  2.5  cm  each.  Six  trials  at  each  target  size  (one  through  four)  were  random 
ly  presented,  for  a  total  of  24  trials  per  sitting.  Average  reaction  time  to 
each  trial  was  recorded.  The  usefulness  of  this  test  for  repeated  measures 
applications  is  discussed  by  Carter,  Kennedy,  Bittner,  and  Krause  (1981),  and 
Carter  and  Krause  (1983). 

Pattern  Comparison.  The  object  of  this  task  was  to  determine 
whether  two  patterns  presented  on  the  screen  (one  in  the  left  half  and  one  in 
the  right  half)  were  the  same  or  different.  High  resolution  graphics  were 
used  to  present  patterns,  each  composed  of  approximately  eight  dots,  on  each 
half  of  the  screen.  Points  were  randomly  selected  using  a  random  number 
generator,  and  dots  were  plotted  in  rapid  succession,  with  the  left  pattern 
appearing  slightly  before  the  pattern  on  the  right  side  of  the  screen. 

An  example  of  a  "different"  trial  is: 
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Subjects  responded  by  pushing  one  of  two  keys  on  the  keyboard  marked  "S"  and 
"D" .  Approximately  one  second  after  each  response,  another  set  of  patterns 
would  appear  on  the  screen.  The  task  lasted  for  three  minutes.  The  score  was 
the  number  correct  minus  the  number  of  incorrect  responses. 

Grammatical  Reasoning.  This  test  was  programmed  according  to  Baddeley's 
(1968)  specifications,  as  described  in  Experiment  1.  Thirty  two  sentences 
were  randomized  and  presented,  sequentially,  to  subjects  as  quickly  as  they 
could  respond  to  whether  the  statement  on  the  screen  was  true  or  false.  The 
intertrial  interval  was  about  one  second.  Responses  were  made  on  the  buttons, 
one  marked  "T"  and  the  other  "F",  of  standard  APPLE®  paddles.  The  task  ended 
automatically  at  the  end  of  two  minutes.  The  score  was  the  number  correct 
minus  the  number  of  incorrect  responses. 

Procedure 


A  few  days  prior  to  the  first  day  of  the  experiment,  subjects  were  shown 
the  laboratory  set-up  and  the  basics  of  operating  the  apparatus.  Additional¬ 
ly,  the  instructions  were  reviewed  and  subjects  were  given  practice  trials  on 
each  test.  The  experiment  was  conducted  over  a  five-day  period.  Within  a 
session,  subjects  moved  from  one  station  to  another  completing  each  of  the 
tests  which  were  housed  in  six  separate  booths.  On  Day  1,  testing  was  com¬ 
pleted  within  150  minutes,  and  five  replications  were  run  on  each  test.  Three 
replications  on  each  test,  requiring  about  75  minutes,  were  given  on  the  four 
subsequent  days.  Ten  subjects  were  tested  between  the  morning  hours  of  8:00 
and  11:00;  the  remaining  four  subjects  were  tested  between  12:30  and  3:30  in 
the  afternoon. 


RESULTS 

General  Analysis  Method 

The  initial  stage  of  analysis  for  both  experiments  included  checking  the 
data  for  outlying  and  missing  points.  Some  subjects'  data  were  eliminated 
from  the  analysis  on  this  basis.  The  total  number  of  subjects  included  in  the 
analysis  for  each  test  is  indicated  in  parentheses,  following  each  test  name. 
An  additional  step  in  the  initial  stage  of  analysis  was  to  calculate  for  each 
test  the  correlations  between  group  means  and  variances  to  determine  whether  a 
transformation  of  the  raw  data  was  necessary.  Transformations  used  are 
specified. 

Secondly,  Days  X  Trials  repeated  measures  analysis-of-variances  (ANOVAs) 
were  computed  for  the  means  and,  separately,  for  the  jackknife  variance  esti¬ 
mates  (Carter  &  Bittner,  1982)  for  each  test.  This  provided  for  examining  the 
days  effect,  and  trials-within-a  day  effect,  as  well  as  their  interaction. 
Intersession  correlations  were  analyzed  sequentially  by  Steiger's  (1980)  meth¬ 
od,  using  the  approach  described  by  Bittner  and  Carter  (1981),  to  determine 
whether  at  any  point  in  practice  they  ceased  to  change  significantly. 

A  summary  of  the  test  administration  times,  scores  recorded  for  each  test 
and  stability  results  are  outlined  in  Table  1  for  Experiment  1  and  Table  2 
for  Experiment  2. 
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Experiment  1 


Grammatical  Reasoning  (N  =  17) 

The  resulting  means,  standard  deviations,  and  correlations  for  this  test 
are  listed  in  Table  3.  As  indicated,  group  mean  scores  showed  a  linear  trend 
over  days,  after  the  initial  four  trials  were  dropped  (£(4,64)  =  8.50,  £  < 
.001).  Means  across  the  four  trials  within  each  day  were  relatively 
homogeneous  (£(12,192)  =  .89,  £  >  .50).  Variances  (listed  in  italics  along 
the  diagonal  in  Table  3)  were  unchanged  across  days  (£(5,80)  =  .69,  £  >  .60) 
and  trials  (£(3,48)  =  .50,  £  >  .65).  The  Days  by  Trials  interaction  was  also 
nonsignificant  (£(15,240)  =  .75,  £  >  .70).  Steiger  (1980)  analysis  method 
indicated  that  intertrial  correlations  were  stable  across  the  last  nine  trials 
(x2(35)  =  35.38,  £  >  .45),  with  an  averaged  reliability  of  .83.  The  delayed 
stabilization  appeared  to  be  due  to  two  unusually  high  correlations  (.95) 
between  trials  23,  24,  and  previous  trials.  Otherwise,  as  indicated  by  Table 
3,  intertrial  correlations  were  relatively  homogeneous  across  trials  10-24. 

Pattern  Comparison  (N  =  17) 

A  correlation  of  .71  between  the  means  and  standard  deviations  suggested 
a  log  transformation  of  the  raw  data.  The  transformed  group  means  increased 
linearly  over  days  subsequent  to  the  first  day  (£(3,48)  =  6.19,  £  =  .001, 
overall;  £(1,16)  =  8.08,  £=  .012,  linear).  The  linear  component  accounted' 
for  88%  of  the  total  variation.  Means  remained  relatively  constant  across  the 
last  three  trials  of  each  day  (£(2,32)  =  2.15,  £  >  .10).  The  Dayl^  X^Trials 
interaction  was  significant  (£(6,96)  =  3.98,  £  <  .01).  The  interaction  is  due 
to  changes  in  subjects'  relative  intertrial  performances  with  increased 
practice.  In  the  initial  experimental  days,  practice  effects  were  apparent 
across  trials  within  a  day.  However,  by  later  days,  performance  on  the  second 
trial  was  essentially  the  same  as  performance  on  the  third  and  fourth  trials. 
As  indicated  in  Table  4,  intertrial  correlations  were  essentially  homogeneous 
across  the  final  seven  trials  (x2(20)  =  24.05,  £  >  .24).  The  averaged 
reliability  across  stable  trials  was  .81.  Overall,  correlations  tended  to  be 
higher  for  adjacent  trials  and  declined  as  trials  became  more  separated  in 
time. 

Purdue  Pegboard  (N  =  17) 

Group  means  were  homogeneous  across  days,  after  dropping  the  initial  four 
trials  on  Day  1  (£(4,64)  =  .54,  £  >  .70).  Additionally,  trials  within  a  day 
remained  constant  (F(3,48)  =  .67,  £  >  .50).  The  Days  X  Trials  interaction  was 
significant  (£(12,192)  =  2.27,  £  <.01),  however,  it  failed  to  remain 
statistically  significant  after  Day  1  was  dropped  from  consideration  (F(9.144) 
=  1.75,  £  >  .08).  Variances  remained  constant  across  all  days  (£(5,80j  =  .78, 
p  >  .57)  and  across  trials  within  each  day  (£(3,48)  =  1.12,  p  >  .35).  The 
Days  X  Trials  interaction  for  variances  was  also  statistically  insignificant 
(F(15,240)  =  1.60,  p  >  .07).  Intertrial  correlations  were  stable  across  only 
tlTe  last  three  trials  (x2(2)  =  2.62,  £  >  .26),  with  an  averaged  reliability  of 
.84.  Table  5  indicates  that  both  same-day  and  cross-day  intertrial 
correlations  were  intermittently  high  and  low,  with  no  obvious  pattern.  This 
suggests  that  the  relative  order  of  subjects'  performances  continued  to  change 
quite  drastically,  until  the  last  three  trials. 
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Aiming  (N  =  17) 

Means  remained  constant  across  days  (£(4,64)  =  ,6F  d  >  .65)  with  the 
first  four  trials  excluded.  Means  within  a  day,  across  trials,  showed 
statistically  significant  change  that  was  mainly  due  to  a  significant  linear 
trend  (£(3,48)  =  21.28,  £  <  .001,  overall;  £(1,16)  =  41.62,  £  <  .001,  linear). 
When  the  initial  trial  each  day  was  dropped,  the  significant  linear  trend 
persisted,  accounting  for  95%  of  the  significant  change  (£(2,32)  =  5.36,  £  < 
.01,  overall;  £(1,16)  =  10.46,  £  <  .005,  linear).  The  linear  trend  in  means 
within  each  testing  day  was  essentially  the  same  across  days,  as  indicated  by 
a  nonsignificant  Days  X  Trials  interaction  (£(8,128)  =  .47,  £  <  .80). 

Jackknife  variance  estimates  remained  relatively  constant  across  days  and 
trials  within  each  day  (respectively,  £(5,80)  =  1.12,  £  >  .35  and  £(3,48)  = 
1.81,  £  >  .15).  In  addition,  there  was  a  nonsignificant  Days  X  Trials 
interaction  (£(15,240)  =  .61,  p  >  .85).  Intertrial  correlations  across  Trials 
16-24  were  stable  x2(35)  =  42.49,  £  >  .18)  with  an  averaged  reliability  of 
.82.  The  intertrial  correlations  fluctuated  randomly,  with  more  low 
correlations  occurring  as  trials  were  more  separated  by  time  (Table  6). 

Spoke  (N  =  17) 

Group  means  across  the  five  experimental  days  were  significantly 
different  (£(5,80)  =  19.23,  £  <  .001).  Means  within  each  day,  across  trials, 
also  fluctuated  significantly  (£(3,48)  =  18.37,  £  >  .001).  Group  means  did 
remain  constant,  however,  across  the  last  four  days,  with  the  first  trial  of 
each  day  excluded  (£(3,40)  =  .65,  £  >  .58).  Within  each  day,  means  were 
constant  across  the  last  three  trials  (£(2,32)  =  .02),  p  >  .97).  There  was  a 
significant  Days  X  Trials  interaction,  however,  (£(6,96)  =  3.52,  £  <  .01),  a 
definite  nonlinear  trend  on  Day  1  with  an  increasingly  linear  trend  across 
trials  toward  later  days.  Variances  remained  constant  across  days  (£(5,80)  = 
.73,  £  >  .60),  and  trials  (£(3,48)  =  2.19,  £  >  .10).  In  addition,  their 
interaction  was  nonsignificant  (£(15,240)  +  1.50,  £  >  .10).  Intertrial 
correlations  were  stable  across  the  last  eight  trials  (x2(35)  =  44.73,  £  > 

.10)  and  the  averaged  reliability  was  .86.  Table  7  indicates  that 
correlations  failed  to  stabilize  across  trials  10-24  because  of  occassional 
low  correlations.  However,  intertrial  correlations  of  trials  earlier  than  10 
with  later  trials  were  more  consistently  low,  particularly  as  they  were  more 
separated  in  time. 

Maze  Tracing  ( N=1 7 ) 

An  AN0VA  on  daily  group  means  indicated  a  highly  significant  Days  effect 
when  means  were  blocked  across  trials  (£(5,80)  =  72.64,  £  <  .001).  There  was 
no  interpretable  trend  in  the  means;  first  through  fifth  order  effects  were 
highly  significant.  This  indicated  that  there  was  an  erratic  pattern  in  the 
means  across  days.  The  Trials  effect,  blocked  across  days,  was  also  highly 
significant  (£(3,48)  =  15.23,  £  <  .001).  In  addition,  the  Days  X  Trials 
interaction  was  also  significant  (£(15,240)  =  35.95,  £  <  .001).  Figure  1 
shows  the  unusual  pattern  in  the  means  that  underlies  the  highly  significant 
results.  As  can  be  seen  in  the  graph,  means  increase  across  the  first  15 
trials,  then  drop  drastically  on  Trial  16,  and  again  increase  linearly 
throughout  the  remainder  of  the  experiment.  Fifteen  alternate  forms  of  this 
test  were  used  on  the  initial  15  trials  of  this  experiment.  On  Trials  16-24 
the  first  nine  alternate  forms  were  reiterated.  The  resulting  means  indicated 
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that  the  alternate  forms  were  probably  not  equivalent;  a  linear  increase  of 
difficulty  across  forms  is  suspected.  This  test  is  dropped  from  futher 
discussion  since  any  findings  would  be  overshadowed  by  the  nonequivalence  of 
the  alternate  forms. 


Figure  1.  Maze  Tracing:  Mean  number  completed  for  24  replications  over 
5  days  (n  =  17). 


Experiment  2 


Code  Substitution  (N  =  13) 

The  ANOVA  computed  on  the  means  revealed  a  significant  Days  effect 
(£(4,48)  =  13.70,  £  <  .001).  A  substantial  amount  of  this  effect,  97%,  was 
attributable  to  the  linear  component  (F(l,12)  =  43.35,  £  <  .001).  The  overall 
Trials  effect  was  nonsignificant  (£(2,74)  =  1.35,  £  >  .27),  as  was  the  Days  X 
Trials  interaction  (£(8,96)  =  1.41,  £  >  .20).  Thus,  while  within-day  trial 
means  remain  relatively  constant,  the  group  means  across  the  five  days 
indicate  a  steady  increase  with  practice.  An  ANOVA  calculated  on  the 
jackknife  variance  estimates  indicated  nonsignificant  Days  and  Trials  effects, 
as  well  as  a  nonsignificant  interaction  (£(4,48)  =1.25,  £  >  .30,  £(2,24)  = 
2.23,  £  >  .13,  £(8,96)  =  .85,  £  >  .56,  respectively).  Jackknife  variances, 
therefore,  remained  constant  across  both  Days  and  Trials.  Intersession 
correlations  were  differentially  stable  across  trials  11-15  (x2(9)  =  13.31,  £ 

>  .14),  however,  the  average  correlation  across  the  stable  trials  (task 
definition)  was  extremely  low  (£  =  .26).  Hence,  although  the  means  increased 
linearly,  and  variance  estimates  remained  constant  over  trials,  this  task 
lacks  reliability.  As  can  be  seen  in  Table  9,  the  intertrial  correlations  for 
this  task  were  generally  'ow,  even  for  trials  given  in  succession.  There  is 
no  obvious  reason  for  t h i ?  finding,  although  the  computer  program  might  be 
suspected. 
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Arithmetic  (N  =  12) 

A  significant  Days  effect  appeared  in  the  ANOVA  on  group  means  (F(4,44)  = 
5.85,  p  <  .001).  This  effect  was  99%  linear  (£(1,11)  =  14.13,  £  <  .0U3),  with 
none  of  the  other  components  reaching  statistical  significance.  The  Trials 
main  effect,  on  the  other  hand,  was  insignificant  (£(2,22)  =  1.49,  £  >  .24), 
indicating  that  within-day  trials  remained  constant.  The  Days  X  Trails 
interaction  was  also  insignificant  (£(8,88)  =  1.59,  £  >  .13).  A  comparable 
ANOVA  on  the  jackknife  variance  estimates  indicated  a  significant  Days  main 
effect  when  all  five  days  were  considered  (£(4,44)  =  2.68,  £  <  .05).  Both  the 
Trials  main  effect  and  the  Days  X  Trials  interaction  were  insignificant 
(£(2,22)  =  .06,  p  >  .94,  and  £(8,88)  =  .34,  £  >  .94,  respectively).  Dropping 
the  initial  day  (Trials  1-3)  from  the  analysis  brought  about  insignificant 
Days,  Trials,  and  interaction  effects  (£(3,33)  =  1.05,  £  >  .38,  £(2,22)  =  .06, 
£  >  .94,  and  £(6,66)  =  .49,  £  >  .80,  respectively).  Variances,  therefore, 
remained  constant  across  Days  2-5  of  all  experiment.  Intersession 
correlations  were  differentially  stable  across  all  15  trials,  as  evidenced  by 
the  Steiger  (1980)  analysis  (x2(104)  =  96.32,  £  >  .69).  The  averaged 
reliability  of  .55  indicated  poor  task  definition  when  all  15  trials  were 
Included;  however,  there  was  a  trend  toward  increased  reliability  as  the 
initial  trials  were  dropped  from  consideration.  Averaged  reliability  across 
Trials  11-15  was  .77.  This  relatively  low  intertrial  reliability,  when 
compared  to  the  paper-and-penci 1  version,  may  be  due  to  the  fact  that  this 
test  includes  all  four  numerical  operations,  which  vary  in  difficulty  for  most 
individuals.  If  the  amount  of  time  spent  on  each  type  of  numerical  operation 
is  variable  this  could  contribute  to  the  overall  instability  of  the  test. 
Recall  that  each  type  of  problem  was  presented  at  random,  once  within  each 
block  of  four  problems.  Given  that  the  test  ended  after  approximately  two 
minutes,  it  was  possible  for  the  most  difficult  type  of  problem  for  any  one 
subject  (1 .e. ,  the  type  of  problem  that  required  the  most  time  for  arrival  at 
an  accurate  solution)  to  outnumber  the  easier  problem  on  one  day  but  possibly 
not  the  next  day.  This  variability  could  be  eliminated  by  changing  the  Math 
test  to  include  only  one  type  of  problem  (e.g.,  addition)  rather  than  all  four 
types  represented  here. 


Stroop  (N  =  9) 


An  ANOVA  computed  on  the 
Trials  interaction  (£(8,64)  = 
impeded  interpretation  of  the 
Trials  effect  was  significant  ' 
.23,  respectively).  When  the 
Days  X  Trials  interaction  was  . 
Again,  the  Days  and  Trials  main 
.80,  £  >  .51  and  £(2,15)  =  .20, 


means  for  all  days  revealed  a  significant' Days  X 
2.13  £  <  .05).  This  significant  interaction 

me  fects  although  neither  the  Days  nor 
'  =  1.75,  £  >  .16  and  £(2,16)  =  1.60,  £  > 

was  dropped  from  the  analysis,  the 
,  gnificant  (£(6,48)  =  1.58,  £  >  .17). 

-vere  also  nonsignificant  (£(3,24)  = 

.  ,  respectively).  Hence,  after  the 


first  testing  day,  means  remained  unchanged.  Jackknife  variance  estimates 
also  remained  constant  over  all  Trials  and  Days  as  indicated  by  an  ANOVA  which 
wncluded  all  observations.  The  statistical  values  for  Days  and  Trials  main 
effects  and  the  Days  X  Trials  interaction  were  repectively  F(4,32)  =  .09,  £  > 
.98,  £(2,16)  =  .29,  £  >  .74,  and  £(8,64)  =  1.98,  £  >  .06.  This  indicates  that 
variances  for  fhe  Stroop  CW  score  remain  constant  across  practice  trials. 
Steiger  analysis  on  the  intertrial  correlations  revealed  an  averaged 
reliability  of  .28  across  the  15  trials,  although  the  correlations  were 
differentially  stable  (x2(104)  =  97.09,  £  >  .67).  When  initial  trials  were 
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dropped  and  only  Trials  12-15  were  examined,  the  averaged  reliability  rose  to 
.49.  There  was  an  inherent  problem  in  the  computer  version  of  the  Stroop 
test,  which  surfaced  while  subjects  were  doing  the  test  and  which  may  account 
for  the  unreliable  intertrial  correlations.  After  the  instructions  appeared 
on  the  screen  and  the  test  began,  the  subjects  frequently  forgot  whether  they 
were  supposed  to  attend  to  the  colors  the  words  were  written  in,  or  attend  to 
the  words,  disregarding  the  color.  That  is,  they  were  often  unable  to 
remember  the  instructions  pertaining  to  the  specific  task  once  the 
instructions  left  the  screen  and  the  test  began.  Although  only  the  color-word 
condition  was  scored,  the  subjects  were  asked  to  do  both  types  of  tasks,  at 
different  times.  Confusion  could  be  eliminated  by  administering  only  one  type 
of  Stroop  task. 


The  ANOVA  on  group  means  for  all  observations  showed  a  significant  main 
effect  for  Days  (£(4,48)  =  5.32,  £  <  .01).  Thirty-six  percent  of  the  total 
sums-of-squares  was  attributed  to  the  linear  component  (£(1,12)  =  6.82,  £  > 
.02),  while  61%  was  quadratic  (£(1,12)  =  17.09,  p  >  .01).  The  Trials  main 
effect  was  not  statistically  significant  (£(2,24;  =  .69,  p  >  .50),  and  neither 
was  the  Days  X  Trials  interaction  (F(8,96)  =  .59,  £  >  .78).  Excluding  the 
initial  testing  day  resulted  in  staFle,  unchanging  means  for  the  remaining 
Days  (£(3,36)  =  .60,  £  >  .61)  and  Trials  (£(2,24)  =  1.11,  p  >  .34).  The  Days 
X  Trials  interaction  was  again  nonsignificant  (£(6,72)  =  .58,  £  >  .74). 
Analysis  of  the  jackknife  variance  estimates  showed  that  they  remain  constant 
across  all  Trials  (£(2,24)  =  .14,  £  >  ,87)  and  all  Days  (£(4,48)  =  1.18,  £  > 
.32)  of  the  experiment.  There  was  no  significant  interaction  between  Days  and 
Trials  (£(8,96)  =  .71,  £  >  .68).  Intertrial  correlations  were  differentially 
stable  across  Trials  6-15  (x2(44)  =  54.70,  £  >  .12)  with  an  averaged 
reliability  of  .78.  Table  10  shows  that  between  Trials  1-15  and  all  other 
trials  there  was  an  occasional  high  or  low  intertrial  correlation  which  kept 
the  correlations  from  reaching  statistical  equivalence.  Other  than  that,  the 
intertrial  correlations  are  relatively  homogeneous  throughout. 


Pattern  Comparison 


Group  means  remained  constant  across  all  observations,  as  indicated  by 
the  analysis-of-variance.  The  statistical  values  for  the  Days,  Trials,  and 
interaction  effects  were  repectively  F(4,36)  =  1.91,  £  >  .12,  £(2,18)  =  .33,  £ 
>  .72,  and  £(8,72)  =  1.20,  £  >  .30.  TTn  ANOVA  on  the  jackknife  variance 
estimates  revealed  that  they  also  remained  essentially  unchanged  across  the 
course  of  the  experiment  (Days:  F(4,36)  =  1.90,  £  >  .13;  Trials:  £(2,18)  = 
1.53,  £  >  .24;  Days  X  Trials:  £(£,72)  =  .48,  £  >  .86).  Intertrial 
correlations  were  differentially  stable  across  Trials  2-15  (x2(90)  =  105.85, 
£  >  .12),  with  an  averaged  reliability  of  .47.  Reliabilities  fell  off, 
reaching  approximately  .27  when  only  the  last  four  trials  were  considered. 
Table  11  indicated  that  intertrial  correlations  were  moderate  to  low,  with  no 
apparent  pattern,  throughout  the  matrix. 


Grammatical  Reasoninc 


f  13) 


All  data  analysis  on  this  test  include  Days  2-5;  Day  1  was  excluded 
because  the  data  was  lost  in  computer  transmission.  An  ANOVA  on  the  group 
means  showed  that  the  main  effects  were  nonsignificant,  with  £(3,36)  =  .51,  £ 
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>  .67  for  Days  and  £(2,24)  =  .26,  £  >  .77  for  the  Trials  effect.  In 
addition,  the  Days  X  Trials  interaction  was  also  nonsignificant  (£(6,72)  = 
.56,  £  >  .75).  Therefore,  grouped  means  remained  constant  across  the  final 
four  experimental  days.  An  ANOVA  on  the  jackknife  variance  estimates 
suggested  that  there  was  no  statistically  significant  change  in  the  variances 
across  Days  (£(3,36)  =  1.6,  £  >  .91)  or  Trials  (£(2,24)  =  .36,  p  >  .69). 
Additionally,  their  interaction  was  nonsignificant  (£(6,72)  =  .69,  £  >  .66). 
Steiger  analysis  of  the  intertrial  correlations  indicated  that  Trials  8-15 
were  differentially  stable  (x2 (27 )  =  31.24,  £  >  .26),  with  an  averaged 
reliability  of  .85.  As  indicated  in  Table  11,  intertrial  correlations  were 
moderate  to  high,  and  relatively  homogeneous  throughout.  However,  two  low 
correlations  (between  Trials  12  and  7,  and  14  and  7)  apparently  prevented  the 
matrix  of  intertrial  correlations  from  being  statistically  homogeneous  prior 
to  Trial  8. 


DISCUSSION 

The  results,  summarized  in  Tables  1  and  2,  indicate  that  when  mass  prac¬ 
ticed,  most  tests  either  lose  or  take  longer  to  achieve  the  statistical  prop¬ 
erties  required  of  tests  used  for  repeated  measures  applications.  Interesting 
comparisons  can  be  made  between  the  massed  and  distributed-practice  results, 
and  likewise  between  paper-and-penci 1  and  computer  massed-practice  results. 
These  comparisons  will  be  discussed  in  turn. 

Daily  group  means  and  variances  for  mass  practiced  computer  tasks 
generally  stabilize  early  in  relation  to  distributed  practice  results  for  the 
equivalent  paper-and-penci 1  tests  (Table  2).  However,  the  correlational 
results  for  mass  practiced  computer  tasks  are  disappointing.  Overall,  the 
intertrial  correlations  and  consequent  averaged  reliabilities  indicate  a  lack 
of  task  definition.  That  is,  when  these  particular  tests  are  subjected  to 
massed  practice,  the  attribute(s)  being  measured  change  from  one  trial  to  the 
next.  One  exception  appears  to  be  Grammatical  Reasoning,  which  reaches  an 
acceptable  level  of  reliability  (.85),  although  it  takes  longer  to  attain 
stability  when  mass  practiced.  The  computer  version  of  Grammatical  Reasoning 
appears  to  yield  a  higher  averaged  reliability  (for  the  same  amount  of  test 
time)  when  mass  practiced  than  our  paper-and-penci 1  version  does  when  practice 
is  distributed  (.85  vs.  .80). 

The  apparatus  and  paper-and-penci 1  massed  practice  and  distributed  prac¬ 
tice  results  are  compared  in  Table  1.  Generally,  the  group  means  take  longer 
to  stabilize  when  mass  practiced.  Variances,  on  the  other  hand,  remained  con¬ 
stant  across  trials,  except  for  Pattern  Comparison,  whose  variances  never 
stabilized.  Intertrial  correlations  took  longer  to  stabilize  in  all  cases 
when  mass  practiced,  however,  the  majority  of  the  tasks  reached  an  acceptable 
averaged  reliability  when  they  finally  became  homogeneous  (all  above  .81). 

In  summary,  paper-and-penci 1  tasks  examined,  except  Pattern  Comparison, 
retain  statistical  properties  required  for  repeated  measures  applications  when 
mass  practiced.  Both  group  means  and  intertrial  correlations  require 
substantially  longer  to  stabilize,  however,  when  mass  practiced  than  when 
practice  is  distributed.  With  the  exception  of  Grammatical  Reasoning,  the 
computer  tasks  failed  to  reach  an  acceptable  level  of  reliability,  and 
therefore  are  not  suited  for  use  in  their  present  forms.  Grammatical 
Reasoning  in  its  computer  form  is  not  as  reliable  as  the  paper-and-penci 1 
version,  but  might  be  acceptable  for  some  applications  because  of  convience. 
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A  change  that  may  improve  the  reliability  of  the  computer  tasks  is  to  reduce 
(ideally  remove)  the  opportunity  for  subjects  to  make  ambiguous  or 
unintentional  responses. 

Paper-and-penci 1  tests  have  a  long  history  and  have  continually  been 
refined  and  improved  over  time.  Computer  adaptations  of  traditional  tests  are 
relatively  new,  and  therefore  it  is  reasonable  that  time  and  effort  may  need 
to  be  expended  before  they  have  the  reliability  and  construct  validity  of 
their  traditional  counterparts.  If  a  serious  effort  is  made  to  continually 
scrutinize  and  improve  computer  tasks,  in  the  future  we  may  have  a  way  of 
testing  abilities  that  is  superior  to  the  traditional  testing  approach. 
Computers  may  potentially  act  to  reduce  experimental  error  by  functioning  as 
consistant,  reliable  test  administrators.  An  added  advantage  is  that  a 
computer  can  score  and  analyze  tests  quickly  and  accurately.  These  factors 
are  merely  a  few  which  contribute  to  the  promise  that  computerized  testing 
holds  for  the  future.  In  order  for  computerized  testing  to  provide  meaningful 
results,  however,  we  must  improve  the  reliability  of  computer  testing 
procedures,  hardware,  and  software. 

Evidence  reviewed  leads  us  to  conclude  that  distributed  practice  should 
be  used  when  possible.  In  situations  where  economic  or  other  constraints 
dictate  that  mass  practice  is  necessary,  well  established  paper-and-penci 1 
and  apparatus  tests  which  yield  high  reliability  should  be  used.  A  sufficient 
number  of  trials  (more  than  20  in  most  cases)  should  be  given  to  ensure  that 
stability  is  reached  prior  to  repeated  measures  use.  New  computer  tasks 
should  be  scrutinized  carefully  for  factors  leading  to  unreliability  and 
instability  prior  to  their  use. 
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Cross-session  correlations,  daily  group  means  and  standard 
deviations  for  Experiment  1  -  GRAMMATICAL  REASONING* 


10.12 


10.88 


11.29 


10.94 


12.82 

12.65 

13.65 
15.00 
13.76 
13.70 

15.18 
14.23 
16.06 

15.18 
15.53 

16.18 

15.65 
16.12 
16.29 


16.35 


16.59 


15.94 


Group  means  are  along  the  left  margin,  standard  deviations  along  the  diagonal 
(in  italics)  and  correlations  within  the  upper  and  lower  triangles 
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Table  4:  Cross-session  correlations,  daily  group  means  and  standard 
deviations  for  Experiment  1  -  PATTERN  COMPARISON* 


Group  means  are  along  the  left  margin,  standard  deviations  along  the  diagonal 
(in  italics)  and  correlations  within  the  upper  and  lower  triangles 


Habbea  rracnce 
22 


TaMe  ^  da"y  9r0“P  *  Standard  deviat">"5  for 


51.35 
47.82 

47.18 
43.65 

44.35 
43.06 

41.18 

43.29 
42.41 

43.59 

45.59 
41.94 

43.23 

42.23 
42.41 
42.65 
42.53 
42.06 
42.53 

44.29 

42.29 
41.00 
42.52 
41.94 


.89 
24  7-13 


-.40 

-.46 

-.19 

-.19 

-.27 

-.20 

.11 

.04 

-.20 

-.04 

.18 

.21 

o 

• 

1 

.22 

.17 

-.02 

.06 

.09 

-.19 

-.13 

.01 

.36 

.14 

.16 

.01 

.35 

.12 

.10 

-.07 

.09 

.03 

.01 

.53 

.39 

.21 

.32 

.16 

.40 

.37 

.12 

.16 

.37 

.56 

.47 

.40 

.53 

.50 

.37 

.28 

.37 

.67 

.39 

.53 

.57 

.61 

.50 

.33 

.49 

.67 

.39 

.1.1 

.33 

.43 

.19 

.39 

.48 

.68 

.52 

.56 

.52 

.49 

.51 

.66 

.74 

.65 

.65 

.75 

.71 

.67 

4-56 

.46 

.68 

3-29 

4-01 

.77 

5-50 

6-95 

.41 

i-86 

4-76 

.55 

.32 

.-98 

.53 

.29 

.11 

.51 

.67 

.81 

.55 

3-80 


-.19 

-.33 

-.28 

-.05 

.08 

.19 

.10 

.22 

.34 

.32 

.05 

.47 

.02 

-.15 

.23 

.52 

.19 

.42 

.34 

-.04 

.37 

.40 

-.03 

.24 

.39 

.40 

.79 

.54 

.47 

.60 

.46 

.78 

.54 

.65 

.62 

.83 

.32 

.85 

.63 

,51 

.75 

5-18 

• 

<_n 

00 

2-98 

6-44 

4-34 

4-14 

.74 

4-09 

.33 

.23 

.63 

.06 

.15 

.67 

.25 

.14 

.83 

.17 

.09 

.61 

.07 

.16 

.50 

.11 

-.04 

.75 

.18 

.04 

.15 

.45 

.29 

.60 

.24 

-.04 

rnrrf]2?Lthe  ]*l!  standard  deviations  along  the  diagonal  (in 

italics)  and  correlations  within  the  upper  and  lower  triangles 


Massed  Practice 
23 


Table  6:  Cross-session  correlations,  daily  group  means  and  standard  deviations  for 
Experiment  1  -  AIMING* 

24  13 


258.06 

l 

.67 

.72 

.76 

.71 

.71 

.59 

.66 

.59 

.78 

.78 

.81 

.78 

284.41 

.40 

.34 

.25 

.24 

.35 

.40 

.28 

.29 

.36 

.32 

.56 

.66 

301.29 

.50 

.36 

.42 

.37 

.41 

.47 

.42 

.37 

.58 

.65 

.64 

.74 

305.82 

.67 

.63 

.63 

.48 

.66 

.60 

.60 

.60 

.68 

.69 

.74 

.80 

326.41 

.65 

.52 

.47 

.45 

.55 

.67 

.61 

.58 

.70 

.66 

.67 

.82 

327.35 

.73 

.66 

.57 

.48 

.57 

.71 

.68 

.63 

.76 

.68 

.68 

.92 

328.88 

.56 

.48 

.42 

.46 

.49 

.68 

.58 

.54 

.65 

.61 

.60 

.84 

330.29 

.44 

.45 

.49 

.53 

.51 

.47 

.46 

.38 

.62 

.62 

.57 

.69 

305.76 

.64 

.49 

.47 

.45 

.52 

.58 

.56 

.49 

.55 

.52 

.58 

.83 

324.23 

.74 

.63 

.60 

.59 

.60 

.68 

.64 

.63 

.74 

.72 

.80 

.87 

338.82 

.77 

.68 

.62 

.58 

.70 

.71 

.73 

.71 

.71 

.66 

.75 

.88 

337.18 

.78 

.75 

.73 

.61 

.77 

.74 

.73 

.76 

.75 

.77 

.88 

.79 

330.00 

.82 

.79 

.72 

.61 

.73 

.84 

.83 

.79 

.84 

.77 

.82 

46-18 

334.23 

.78 

.78 

.81 

.76 

.73 

.75 

.75 

.82 

.84 

.84 

35-45 

53-53 

333.23 

.68 

.66 

.76 

.77 

.69 

.73 

.78 

.73 

.91 

42-51 

49-66 

.58 

338.12 

.82 

.81 

.79 

.80 

.76 

.85 

.87 

.86 

44-84 

39-40 

.82 

.66 

306.70 

.83 

.89 

.8C 

.72 

.83 

.90 

.89 

42-21 

4  3-64 

.90 

.76 

.74 

322.94 

.87 

.88 

.79 

.72 

.87 

.91 

46-47 

48-64 

.92 

.93 

.79 

.63 

330.18 

.78 

.79 

.68 

.68 

.76 

3  9-41 

48-29 

.92 

.87 

.84 

.72 

.73 

332.88 

.83 

.89 

.86 

.75 

59-71 

42-56 

.91 

.90 

.81 

.84 

.79 

.65 

310.82 

.76 

.73 

.85 

53-64 

53-23 

.69 

.65 

.62 

.58 

.69 

.66 

.79 

326.59 

.86 

.89 

57-70 

40-68 

.60 

.87 

.83 

.80 

.74 

.78 

.75 

.53 

333.76 

.92 

59-57 

42-11 

.85 

.71 

.88 

.91 

.95 

.90 

.91 

.83 

.72 

339.29 

24 

54-37 

40-97 

.92 

.86 

.65 

.90 

.88 

.90 

.86 

.80 

.76 

.68 

i 

43-34 

.80 

.86 

.64 

.49 

.66 

.75 

.82 

.90 

.76 

.65 

.70 

12  1 
♦Group  means  are  along  the  left  margin,  standard  deviations  along  the  diagonal 
(in  italics)  and  correlations  within  the  upper  and  lower  triangles 
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Cross-session  correlations,  daily  group  means  and  standard 
deviations  for  Experiment  1  -  SPOKE* 


.43  .35 

.35  .26 

.48  .36 

.48  .41 

.43  .39 

.47  .38 

.63  .56 

.55  .45 

.49  .42 

.68  .55 

.68  .58 

.76  .65 

.73  .67 


.42  .51  .51  .54  .44  .43  .65  .57 

.39  .39  .49  .54  .38  .44  .71  .67 

.37  .42  .53  .57  .46  .46  .68  .69 

.31  .40  .52  .58  .48  .53  .70  .68 

.31  .41  .50  .58  .47  .55  .77  .67 

.41  .50  .62  .70  .52  .56  .80  .79 

.57  .63  .68  .61  .64  .66  .73  .67 

.48  .52  .68  .72  .55  .55  .77  .77 

.47  .51  .55  .58  .52  .58  .72  .65 

.59  .64  .76  .72  .66  .62  .81  .75 

.60  .68  .78  .88  .63  .67  .89  .94 

.64  .72  .81  .83  .65  .67  .86  .86 

.80  .80  .87  .84  .81  .79  .89  .85 


.74  .66 


.71  .77  .83  .93  .70  .78 


.67  .68 


.73  .74  .75  .84  .74  .85 


.78  .86 


.84  .85  .74  .79 


-08  .90 


.83  .88 


.91  .92  .83  .83 


.91  .77 


.85  .81 


.81  .89 


-08  .95  .85  .79 


.89  .78 

.93  .92 

.84  .85 

.85  .93 


.88  .92 


-08  .93  .88  .85  .82 

-08  .82  .85  .83  .82  .72 

.82  .92  .89  .86  .77  .65 


•06  .79  .87  .85  .81  .73  .73  .70 


.76  .87  .91  .85  .88  .87  .86  .78 

.67  .83  .70  .88  .78  .78  .82  .80 


-08  .94 


.74  .86  .85  .86  .83  .82  .86  .79 


Group  means  are  along  the  left  margin,  standard  deviations  along  the  diagonal 
(in  italics)  and  correlations  within  the  upper  and  lower  triangles 
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♦Standard  deviations  are  along  the  diagonal,  (in  Italics)  means  are  along  the  right  and  left  margins,  and 
correlations  are  within  the  upper  and  lower  triangles 


Cross-session  correlations,  daily  group  means  and  standard  deviations  for  Experiment  3  -  PATTERN 
COMPARISON  (upper  diagonal  and  right  margin)  &  GRAMMATICAL  REASONING  (lower  diagonal  and  left  mar 
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♦Standard  deviations  are  along  the  diagonal,  (in  italics)  means  are  along  the  right  and  left  margins,  and 
correlations  are  within  the  upper  and  lower  triangles 


